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Abstract 

We review the status of the parallel versions of the computer algebra system FORM. In particular, we provide a 
brief overview about the historical developments, discuss the strengths of ParFORM and TFORM, and mention typical 
applications. Furthermore, we briefly discuss the programs FIRE and FIESTA, which have also been developed with 
the Collaborative Research Center/TR 9 (CRC/TR 9). 
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1. Introduction 

The symbolic manipulation of complicated formu¬ 
lae has a long tradition in particle physics. Com¬ 
puter algebra systems (CAS) have been used already 
quite early in order to evaluate, e.g., traces over y 
matrices. Among the first CAS there are REDUCE (T) by 
A. Hearn, SCH00NSCHIP EH!, designed by M. Velt- 
man, ASHMEDAI |5) by M. Levine, and Macsyma l6l 
developed at MIT. Afterwards Mathematica 0, 
Maple 0 and others have been developed which are 
still in use nowadays. However, their field of applic¬ 
ation is limited to small and medium sized problems 
since it is not possible to work with very large inter¬ 
mediate expressions. On the other hand, there are quite 
a number of problems which produce intermediate ex¬ 
pressions of the order of a few hundred giga bytes up to 
tera bytes to be manipulated by the CAS. The only CAS 
currently available in order to cope with such tasks is 

form mm . 

FORM is a program for the symbolic manipulation of 
algebraic expressions. It is specialized to handle very 
large algebraic expressions of billions of terms in an ef¬ 
ficient and reliable way. That is why it is widely used, 


in particular in the framework of perturbative Quantum 
Field Theory, where often several thousands of Feyn¬ 
man diagrams have to be computed. However, the abil¬ 
ities of FORM are also quite useful in other fields of sci¬ 
ence where the manipulation of huge expressions is ne¬ 
cessary. 

FORM is constructed in such a way that the size of the 
expressions is not restricted by the main memory of the 
computer but only by the space available on hard disk. 
In addition its data representation is very dense when 
compared to other general purpose systems. Actually 
in modern applications in particle physics it happens 
quite often that the size of intermediate expressions for 
each Feynman diagram may become huge. As a con¬ 
sequence, even with FORM such calculations require a 
CPU time of several years despite the steady advance¬ 
ment of the hardware and the continuous improvement 
of the algorithms. Furthermore the resources as far as 
CPU speed, memory and disk space are concerned are 
often not sufficient. 

One of the most efficient ways to increase the per¬ 
formance is based on parallelization which makes sim¬ 
ultaneously available the resources of several computers 
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and thereby significantly reduces the wall clock time. In 
fact, the project to obtain a parallel version of FORM has 
been started at the end of the nineties. In the recent 
years ParFORM fTTI and TFORM l(T2l have become reli¬ 
able tools which shall be described in this contribution. 

There is a number of calculations performed within 
project A1 of the CRC/TR 9 where ParFORM and TFORM 
were essential for the successful completion 11314271. In 
all these cases the single-core CPU time was estimated 
to several years. Parallelization could reduce the wall 
clock time to weeks and months at most. 

As a further application we want to mention Ref. [28] 
where FORM was used to solved exceptionally large sys¬ 
tems of equations to create mathematical tables for gen¬ 
eral use in mathematics and physics. 

The calculation of three-loop helicity-dependent 
splitting functions in QCD [^9j [30 ] also could only be 
completed thanks to FORM because expressions of one 
tera byte or more were no exception and at one point 
more than 6 tera bytes of diskspace was needed for a 
single diagram. 

Within the CRC/TR 9 two concepts for parallel ver¬ 
sions of FORM have been successfully developed and im¬ 
plemented: ParFORM, essentially based on MPI (mes¬ 
sage passing interface), and TFORM which uses threads 
for the parallelization. Both programs run stable, show 
a good speedup and are complete in the sense that all 
programs written for the serial version of FORM can now 
be used with ParFORM and TFORM. In Sections [4] and \5\ 
details to the parallel versions are provided. 

In this project of the CRC/TR 9 also programs con¬ 
cerned with the reduction of families of Feynman integ¬ 
rals to a small set of basis elements (master integrals) 
and their numerical evaluation have been developed. 
These two topics are covered in two program packages, 
FIRE and FIESTA, which are discussed in Section[6] 

We continue this review in Section [2] with some his¬ 
torical remarks concerning the first steps towards par¬ 
allelization of FORM and describe in Section [3] the basic 
features of FORM. 

2. Historical remarks 

The first initiatives of parallizing FORM go back to 
early 1991, when version 1 of FORM was made to run on 
a computer at the Fermi National Accelerator Laborat¬ 
ory (FNAL) which was designed for lattice calculations 
and had 257 processors. Due to limitations in accessib¬ 
ility this project was discontinued, but the further devel¬ 
opment of FORM took this experience into account. 

The first systematic study of a parallel version of 
FORM has been performed within the DFG-funded Re- 
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Figure 1: Speedup for the program BAICER on Compaq-AlphaServer 
with 8 Alpha (EV67) processors with 700 MHz. 


search Unit “Quantenfeldtheorie, Computeralgebra und 
Monte-Carlo Simulation” which ran from 1996 to 2002 
and thus can be considered as a precursor to the 
CRC/TR 9. In Ref. OH a first parallel prototype of 
FORM has been presented and results for several stud¬ 
ies like the runtime for the parallel sorting on different 
architectures are shown. 

One year later, in July 2000, the first “working par¬ 
allel FORM prototype, ParFORM”, has been introduced in 
Ref. ll32l . It was based on the syntax of a preliminary 
version of FORM 3 which at that time was not published 
yet. In l32l the parallelization on clusters has been dis¬ 
cussed based on the following hardware: 

• Digital workstation cluster (TTP Karlsruhe) run¬ 
ning DEC UNIX 4.0D 8 nodes with 600 MHz 
Alpha 21164A (EV56) processors and 512 MB 
RAM, 

• PC cluster (TTP Karlsruhe) running Linux 2.2.13 
4 nodes with 500 MHz Intel Pentium III processors 
and 256 MB RAM, 

• IBM SP2 (Computing Center Karlsruhe) running 
AIX 4.2.1 160 thin P2SC nodes with 120 MHz pro¬ 
cessors and 512 MB RAM (256 nodes in total). 

Next to several feasibility studies also results for the 
speedup of a MINCER [33| job is shown. A reason¬ 
able speedup of 2.5 with four nodes on the PC cluster, a 
factor of 4.5 with eight nodes on the Alpha cluster and a 
factor of 6 with twelve nodes on the IBM SP2 has been 
reported. As a first physical application of ParFORM 
higher moments of deep inelastic structure functions at 
next-to-next-to-leading order of perturbative QCD have 
been computed in Ref. (34l . 

At a later stage of the Research Unit ParFORM was 
further developed and one could run parallel FORM jobs 
on symmetric multiprocessing (SMP) computers (not 
only on clusters). In Fig. [I] the speedup is shown for 
the test program BAICER, a FORM program developed to 
compute massless four-loop two-point integrals within 
the project A1 of the CRC/TR 9, running on a 
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Figure 2: Computing time and speedup for the test program BAICER 
on the SGI Altix 3700 server with 32x Itanium-2 processors (1.3 
GHz). 
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Figure 3: Graphical representation of the processing of an input ex¬ 
pression in FORM. 


• Compaq-AlphaServer GS60e, 8 Alpha (EV67) 
processors (700 MHz). 

A speedup of about 4.5 could be achieved using eight 
processors. 

Two years after the start of the CRC/TR 9 a first 
version of Par FORM operating on Cluster- and SMP- 
architectures was discussed in Ref. El- It could run 
arbitrary FORM programs in parallel and was based on 
FORM 3 version 3.1 0. At that time there were already 
a number of applications which would not have been 
possible without ParFORM 1T3l fl4lfl6l if34l. 

For the calculations and for the development of 
ParFORM a 32-core computer was available 

• SGI Altix 3700 Server 32x 1.3 GHz/3 MB-SC 
Itanium-2 CPUs 64 GB DDR/116 MHz mem, 2.4 
TB SCSI hard disks. 

The results for the test program BAICER are shown in 
Fig. [2] The speedup is almost linear up to twelve pro¬ 
cessors. Afterwards it flattens but is still considerable. 
An achieved speedup of 12 means that a FORM job that 
would need one year of computing time can be run as 
ParFORM job in about one month. This leads to a qual¬ 
itatively new level, because it would practically be im¬ 
possible to run jobs for years whereas months are feas¬ 
ible nowadays. Fig. [2] shows that with 16 processors a 
speedup of 10 could be reached. This means that one 
can run on a 32-processor computer two jobs simultan¬ 
eously, having the speedup of 10 for each of them. 

In the paper 1351 the functionality of FORM and 
ParFORM was extended and facilities were introduced 
to communicate with external resources. This mechan¬ 
ism enables the user to include into the FORM programs 
other pieces of software which are used as black box in 
order to take over certain tasks. As a typical example we 
want to mention is f ermat [361, which can compute the 
greatest common divisor of multi-variable polynomials 
efficiently. 


In February 2007 TF0RM lH2l based on P0SIX threads 
has been released, a further major step in the develop¬ 
ment of parallel FORM versions. For later developments 
and further comparisons between ParFORM and TF0RM 
we refer to the proceedings contributions t37H40l and 
to Sections[4]and[5] 

The more recent developments concern the release of 
FORM 4.0 m and the inclusion of tools to generate op¬ 
timized code ED which is used as input in FORTRAN or 
C programs for numerical integrations. 

3. Sequential version of FORM 

This article is not intended as an introduction to FORM 
or even a reference manual. Nevertheless we want to 
describe the basic features which are important in the 
context of parallelization. 

A FORM program is in general divided into so-called 
modules which are terminated by a “dot”-instruction. 
During the execution of the program, which is only 
possible in batch-mode, each module is processed sep¬ 
arately one after the other which essentially occurs in 
three steps 

• Compilation: The input is translated into an in¬ 
ternal representation. 

• Generating: For each term of the input expressions 
the statements of the module are executed. This in 
general generates a lot of terms. 

• Sorting: All the output terms that have been gen¬ 
erated are sorted and equivalent terms are summed 
up. 

This is illustrated in Fig. [3] 

The fundamental objects which are manipulated by 
FORM commands are expressions which are viewed as 
sums of individual terms (see also Fig. [3}. Next to a 
sophisticated pattern matcher, it is the strength of FORM 
that only local operations on single terms are allowed, 
like replacing parts of a term by some other expressions. 
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1 expr = a*x+x A 2; 
id x = a + b; 


First Second 
term term 

a*X +x A 2 



. sort 


Generating 

+a A 2 +a*b +a A 2 +a*b +a*b +b A 2 

Sorting 

+2*a A 2 +3*a*b +b A 2 


if(count(b, 1) ==1) ; 
multiply 4*a/b; 
endif; 


Generating 


+2*a A 2 +12*a A 2 +b A 2 

frt i n g 


print; 
. end 


+l4*a A 2 +b A 2 

Figure 4: Example for the generation and sorting of data in FORM. 


Non-local operations like replacing a sum of two terms 
are not allowed. For example, the command identify 
(short: id) identifies the left-hand side with the right- 
hand side and can be used as 

id a = b + c; 

On the other hand, the usage 

id a + b = c; 

would lead to an error message. 

Non-local operations are allowed only implicitly, 
e.g., in the sorting procedure at the end of the modules, 
where equivalent terms are combined. At first sight this 
seems to be a strong limitation for the formulation of 
general and efficient algorithms. It is usually possible 
to get around this limitation by designing algorithms in 
clever and non-standard ways. 

Due to the locality of the operations it is possible 
to handle expressions as “streams” of terms that can 
be read sequentially from the memory or a file and 
processed independently. This enables FORM to deal 
with expressions that are larger than the available main 
memory. 

An example illustrating the principle operating mode 
of a FORM program is shown in Fig. [4] It corresponds to 
the simple program 

1 expr = a*x + x~2; 
id x = a + b; 

. sort 

if (count(b,1)==1); 

multiply 4*a/b; 
endif; 
print; 

. end 


1 expr = a*x+x A 2+b*x+.. 
id x = a + b; Master 


a*x +x A 2 

I chunk 


+b*X ... 

II chunk 


\ 


Worker I 


» 


2 


Worker II 



Final sorting, output result, 
go to the next module 


Figure 5: General conception of ParFORM. 


4. ParFORM 

4.1. The concept of ParFORM 

As mentioned above, the locality principle enables 
FORM on the one hand to deal with expressions that are 
larger than the available main memory, on the other 
hand it also allows for parallelization. The concept im¬ 
plemented in ParFORM is straightforward and indicated 
in Fig. [5] in a first step the master process splits the ex¬ 
pression into pieces, so-called chunks. Each chunk is 
sent to one of the workers where an independent FORM 
process runs, i.e. the module to be executed is compiled, 
the terms are generated, sorted and sent back to the mas¬ 
ter. Once all worker processes have finished their jobs 
the master performs the final sorting. 

The communication between master and workers is 
based on the message passing interface (MPI) stand¬ 
ard (42) which provides a library for the data transfer 
between processes. Message passing permits to paral¬ 
lelize FORM on computer architectures both with shared 
memory, i.e. SMP computers and on computer clusters. 
The way the master communicates with the workers is 
sketched in Fig. [6] 

It is worth mentioning that the parallelization does 
not require any additional efforts from the user. It is 
possible to run the programs written for the sequential 
version using ParFORM and adding a specification con¬ 
cerning the number of processors. It is clear that differ¬ 
ent codes show a different performance and efficiency 
in the parallel version. In particular, modules in which 
the outcome depends on the order in which the terms 
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Input for: 


Worker 1 Worker 2 



Master / Workerl Worker2 

PROCESS P PROCESS1 PR0CESS2 

>irun-np 3 p-arform .... _ 


DATA 

i\ 



WpflKIMC 

._ L 

_/ 


RESULT 


Output of: 

IWorker 1 Worker 2 



Worker! 


ilTING 


Output of: \ 

Worker 1 Worker 2 


RESULT 


Master 

PROCESSO 


Input for: 

Worker 1 Worker 2 


Figure 6: Visualization of the mode of operation of ParFORM based 
on MPI. 


Figure 7: Mode of operation implemented into ParFORM for a 
NUMA architecture. 


are processed cannot be parallelized and are executed 
in sequential mode. This concerns mostly the use of 
the dollar variables which were introduced in version 3. 
In the case that FORM would switch to sequential mode, 
while actually this is not needed, the user can add an ex¬ 
tra statement to overrule such a decision and tell FORM 
how to deal with the ‘dubious’ case. 

4.2. ParFORM on a NUMA architecture 

The SGI Altix computer is realized with a so-called 
NUMA architecture where NUMA stands for non- 
uniform memory access. This means that the individual 
processors have a faster access to some parts of the 
main memory than to others. A specialized version of 
ParFORM has been developed which exploits the feature 
and, at the same time, does not use MPI and the over¬ 
head connected to it. The corresponding scheme of op¬ 
eration is illustrated in Fig. [7] 

Using the specialized version of ParFORM in connec¬ 
tion with the 32-core SGI Altix a considerable improve¬ 
ment of the speedup could be obtained, as can be seen in 
Fig. [8] In fact, for 16 processors the speedup improved 
from 8 to 10, for 32 processors from 10 to 13 (see also 
the discusion in the next subsection). 

4.3. ParFORM on clusters and multi-core nodes 

At present, there are a number of calculations of 
physical quantities which would not have been possible 
without the gain in performance and speedup provided 


by ParFORM (see, e.g., Refs. fl4ll34l0 . Most of the ap¬ 
plications are connected to the evaluation of four-loop 
Feynman integrals which occur in the context of per¬ 
turbative quantum field theory. In particular, there are 
algorithms which transform the mathematical complex¬ 
ity of the original problem to the need of simple manip¬ 
ulations of rather large polynomial expressions which 
have billions or even more terms. Manipulations of this 
type constitute the basis of the speedup curves which 
are discussed in the following. 

The results for the test program running on a SGI 
Altix 3700 server with 32 Itanium-2 processors are 
shown in Fig. [8] where both the runtime and the speedup 
(as compared to the sequential version) is shown as a 
function of the number of processors, p , involved in the 
calculation. The almost horizontal line between p - 1 
and p - 2 is due to the fact that for p - 2 one of the 
processors takes over the role of the master and the other 
one of the worker. Thus a real reduction of the CPU time 
only starts from p - 3. It is interesting to note that the 
speedup is almost linear up to twelve processors. Fur¬ 
thermore, for 16 processors the program is faster by an 
order of magnitude. As a consequence instead of years 
one only has to wait a few months in order to obtain the 
results of a calculation. This provides the possibility to 
consider qualitatively new kinds of problems, since in 
practice it is impossible to run a job for years whereas a 
few months are feasible nowadays. Beyond p - 16 the 
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Number of processors p 

Figure 8: Runtime and speedup for the test program BAICER running 
on a SGI Altix 3700 server with 32 Itanium-2 processors (1.3 GHz). 
The lower curve corresponds to the MPI version and the upper one to 
the shared memory version of ParFORM. 


curve becomes more flat, however, the speedup is still 
considerable up to 32 processors. 

The latest speedup plot for (the MPI version of) 
ParFORM is shown in Fig. [9] where BAICER is running 
on the cluster ttpmoon which has the following config¬ 
uration: 

• Computer cluster (TTP Karlsruhe) running Linux, 
8 nodes with 2 Hexa-Core Intel Xeon X5675 (3.07 
GHz), 96 GB RAM, and 3.6 TB local hard disk 
(Raid 0 with 6 stripes), interconnected by QDR In¬ 
finiBand. 

The top plot shows the used time in minutes as a func¬ 
tion of the involved CPUs (including the master) and on 
the bottom the speedup as compared to the serial ver¬ 
sion is plotted^ It is interesting to note that a spee¬ 
dup of about 10 is reached in case 16 CPUs are used, 
a value obtained in Fig. [8] for the shared memory ver¬ 
sion which avoids the use of MPI, cf. Subsection |4.2| 
For higher number of CPUs the curve flattens but nev¬ 
ertheless reaches a speeup above 20 for 96 CPUs. 

4.4. ParFORM6W “low-level” clusters 

ParFORM has been successfully installed on several 
clusters. In Fig. [lO] the corresponding speedup curves 
are shown and compared to the curve from Fig. [8] ob¬ 
tained on the SMP computer. The cluster XC6000 


1 Note, that there is no data point for two CPUs; otherwise one 
would observe a flat behaviour between one and two CPUs and only 
then the curve starts to raise. 




The number of CPUs 


Figure 9: Timing and speedup plot for the ParFORM benchmark job 
BAICER running of ttpmoon. 


is a Hewlett Packard Itanium-2 QsNet interconnected 
cluster. This is the only tested cluster which demon¬ 
strates a better behaviour than the SMP computer, how¬ 
ever, it is also significantly more expensive. Fphctl is 
a cluster consisting of 32-bit Xeon nodes. This cluster 
has been tested both with an Infiniband (FphctlIB) 
and a simple Fast Ethernet (FphctlEN) interconnec¬ 
tion. Whereas the latter is not of interest in practice 
the former shows a quite reasonable behaviour follow¬ 
ing closely the SMP curve for a smaller number of pro¬ 
cessors. Pie jade and Empire are both dual Opteron 
clusters. However, Pie jade is interconnected using In¬ 
finiBand whereas Empire uses Gigabit Ethernet. Both 
clusters show a reasonable behaviour leading to a spee¬ 
dup of about six for ten processors. 

We want to mention that the SMP curves shown in 
Fig. [TO] are based on the shared-memory model men¬ 
tioned above. On the other hand, for the clusters one 
has to rely on the MPI library which for our applica¬ 
tions has a significant overhead. 

5. TF0RM 

In the last decade multi-core processing has become a 
key technology in the computing industry as system per- 
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Number i>f proeessors 


Input for: Worker 1 Worker 2 


RESULT 



Worker2 


Figure 11: Mode of operation for TFORM. 

Figure 10: The speedup for the test program on different clusters in 
comparison to the SMP computer (cf. Fig. [8). 


formance improvement through increasing clock rates 
of single-core processors is hindered by physical limits. 
From laptops to supercomputers multi-core processing 
is prevalently used and the modern operating systems 
allow one to easily use them as SMP computers. Al¬ 
though ParFORM works on such SMP computers, in¬ 
terprocess communications among the master and the 
workers via MPI can have a significant overhead when 
gigantic expressions are transfered. 

This overhead problem can be overcome on SMP ar¬ 
chitectures with the help of another model for the com¬ 
munication. In this approach the master explicitly alloc¬ 
ates shared memory buffers which can be accessed both 
by the master and the workers. In these memory seg¬ 
ments the master prepares the chunks for the workers, 
they are doing their job and the master collects the res¬ 
ults again from the shared buffers. Thus, copying huge 
amounts of data is not necessary any more. The use of 
the shared-memory model on SMP machines led to an 
increase in the speedup of 20-25% (cf. Fig. [8]). This 
concept is taken even further in TFORM fl2l . a multith¬ 
readed version of FORM. 

In TFORM the implementation uses the POSIX threads 
library, which is available on all modern UNIX systems 
and therefore portable. The way the master commu¬ 
nicates with the workers is sketched in Fig. |TT] TFORM 
starts with one master thread and N worker threads in a 
so-called thread pool. The workers sleep until the mas¬ 
ter assigns tasks, and hence do not spend any CPU time. 
When the master has some task to be distributed over 


the workers, the master wakes up one of the sleeping 
workers and assigns the task to it. Terms in expres¬ 
sions, grouped as chunks for reducing the overheads, 
are distributed in this way. After distributing all terms, 
the master waits for all the workers to finish the tasks, 
and then the master merges the results of the workers 
in a final sorting operation. The data transfer among 
the threads is done via the shared memory buffers and 
by using memory locks for synchronization between the 
master and the workers (see Fig. ED- 

Due to the model for the communications, some fea¬ 
tures improving the performance are relatively easy to 
implement in TFORM, whereas their implementations are 
difficult in ParFORM. One of them is a load balancing 
system. If there is a single worker that is assigned terms 
requiring much CPU time, for the final sorting the mas¬ 
ter may have to wait for this worker even after the other 
workers finish their tasks and become idle. To avoid 
such inefficiency, after distributing all terms to be pro¬ 
cessed, the master looks for idle workers. If such work¬ 
ers are found, terms are stolen back from the chuncks 
of workers that are still busy and redistributed over idle 
workers. Experiments with an even more fine-grained 
load balancing were unsuccessful, because they resul¬ 
ted in too much overhead. 

Another feature in TFORM concerns the parallel sort¬ 
ing. In the final sorting, TFORM used to adopt the simple 
model in which the master merges the outputs from all 
the workers simultaneously. Therefore it often happens 
that the master is busy while the workers are waiting 
for the master to accept their next chunks of the res¬ 
ults. It becomes a bottleneck, especially when the num- 
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Figure 12: Illustration of the mode of operation of sortbots. 


ber of the workers is large. To alleviate this bottleneck, 
an improved model of the final sorting has been imple¬ 
mented in TFORM. In this model, each two workers send 
their results to a special worker thread, called a sortbot, 
which merges the results. Then each two sortbots send 
their results to another sortbot. This continues until the 
last two sortbots send their results to the master, which 
merges the final two results and writes the result to disk. 
This is illustrated in Fig. [12] Because also this method 
still involves much waiting, a run with N workers will 
rarely use more than the CPU time provided by N cores, 
even when the computer has many more cores. The total 
wall clock execution time improves measurably by this 
method, although it does go at the cost of extra memory 
needed for the buffers of the sortbots. 

Fig. [13] shows up-to-date timing and speedup plots for 
ParFORM and TFORM running on ttpmoon^] Note that 
the cluster ttpmoon consists of 12-core nodes which ex¬ 
plains the end point of the TFORM curves where a spee¬ 
dup better than 9 is reached. ParFORM reaches for 12 
CPUs, which means 1 master and 11 workers, a spee¬ 
dup of 8. 


6. Further developments within CRC/TR 9 

6.1. Reduction to master integrals with FIRE 

Nowadays the vast majority of calculations of higher 
order quantum corrections involve a huge number 
(sometimes exceeding several millions) of different 
contributing integrals. The standard way to reduce 
their number to a manageable amount is based on the 


2 The ParFORM curves are already shown in 


Fig|] 




The number of CPUs 


Figure 13: Comparison of timing and speedup for ParFORM and 
TFORM running on ttpmoon. 


so-called “Laporta algorithm” which is described in 
Ref. Il43l . There are many different implementations 
of this algorithm, some of them are publicly available 
like AIR l44l or Reduze [45] 46], others are private like 
crusher (471 which has been developed in the context 
of project A1 of the CRC/TR 9. Within project A2 the 
program FIRE lf4&U5Tll has been developed. 

FIRE stands for Feynman Integral REduction and im¬ 
plements a special version of the Gauss elimination 
method to solve the system of linear equations, which 
is generated by the application of the integration-by¬ 
parts relations l52j . for the master integrals. It uses sev¬ 
eral external programs like Snappy ED for data com¬ 
pression, KyotoCabinet ED as database to store data 
on disk, Fermat [36) for algebraic simplifications, and 
LiteRed ll55l to retrieve additional rules among integ¬ 
rals. 

The operation of FIRE is divided into two parts: in 
a first step the input for the reduction step is prepared 
within Mat hematic a. This includes the generation of 
all integration-by-parts relations, the generation of sym¬ 
metry relations, the identification of the sectors of in- 
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Figure 14: Integral family with three massive (double lines), three 
massless lines and an irreducible numerator (not shown). It it evalu¬ 
ated in forward-scattering kinematics, i.e., there is one external mo¬ 
mentum, p i, flowing through the upper massless line and another, p 2 , 
through the lower massive lines. 


dices where integrals vanish, i.e. the so-called bound¬ 
ary conditions, and the preparation of a list of integrals 
which shall be reduced. The second step is significantly 
more time consuming. In the latest version, FIRE5 l50l . 
this part is written in C++. Here the systematic reduction 
to master integrals is performed. The output is a table 
for the list of integrals provided in part one. 

To demonstrate the use of FIRE let us, for example, 
consider the integral family of Fig. [14] which has three 
massive internal lines (with mass m). For the external 
momenta we have p 4 = p { , p 3 = p 2 with p\ = p\ = 0. 
Integrals of that type contribute to the next-to-leading 
order corrections to double-Higgs boson production. In 
fact, the imaginary part, which is a function of x = m 2 /s 
(with s = (pi + p 2 ) 2 ), is related to the total cross 
section via the optical theorem. The input for the 
Mathematica part of FIRE contains the following ele¬ 
ments (for a detailed description of the commands we 
refer to Ref. ll50l ): 

(* load FIRE: *) 

Get ["FIRE5.m"]; 

(* define integral family: *) 

Propagators = {m~2 - (vl-v2)~2, 
nT2 - (p2-vl)~2, nT2 - (p2-v2)~2, 

-v2~2, -vl~2, -(pl+vl)~2, -(pl+v2)"2}; 
Internal = {vl,v2}; 

External = {pl,p2}; 


For [jj = l,jj<=Length[set2],jj++, 
ncount = ncount + 1; 
ff[ncount] = 

IBP [setl [ [ii] ] , set2[[jj]] 

] /. kinset; 
startinglist = 

Join [startinglist,{ff[ncount] }] ; 

]; 


(* boundary conditions: only contributions 
with cuts through at least 2 Higgs 
lines are kept: *) 

(RESTRICTIONS = { {0,-1,0,0,0,-1,0}, 

{ 0 , 0 ,- 1 , 0 , 0 , 0 ,- 1 },{ 0 , 0 , 0 , 0 , 0 ,- 1 ,- 1 }, 
{- 1 , 0 , 0 , 0 , 0 , 0 , 0 },{- 1 ,- 1 , 0 , 0 , 0 , 0 , 0 }, 
{- 1 , 0 ,- 1 , 0 , 0 , 0 , 0 },{ 0 ,- 1 ,- 1 , 0 , 0 , 0 , 0 } }); 
SYMMETRIES = { {1,3,2,5,4,7,6} }; 

Prepare [] ; 


(* save data to top212hla. start: *) 
SaveStart["top212hla"]; 

The last command writes all generated information into 
the so-called “start” file which, together with the list of 
integrals, serves as input for the reduction step. The 
steering file, top212hla. conf ig, for the latter has the 
following form 


#threads 

#variables 

#start 

#problem 

#integrals 

#output 


4 

d, s ,m 

11 7|top212hla.start 
top212hla.ind 
top212hla.tab 


where we refer to Ref. 150) for the precise meaning of 
the individual commands. The integrals which shall be 
reduced can be found in the file top212hla. ind which 
might have the form 


{{1, 

a, 

i, 

i. 

2, 

2, 

2, 

2», 

a. 

u, 

i, 

i. 

1, 

1, 

1, 

2», 

a. 

u, 

i, 

i. 

1, 

1, 

1, 

-1»> 


(* IBP relations: *) 

PreparelBP []; 

kinset = {pl~2 -> 0, p2~2 -> 0, 
pl*p2 -> s/2}; 
setl = Internal; 
set2 = Join [Internal,External]; 
ncount = 0; 
startinglist = {}; 

For [ii=l,ii<=Length[setl],ii++. 


Here the individual entries are lists where the in¬ 
teger in the first entry numbers the family and the 
second entry contains seven integers specifying the 
indices of the propagators as specified above (see 
“Propagators”). The reduction is initiated with the 
help of ./FIRE5 -c top212hla. After the job is 
completed the reduction table can be found in the 
file top212hla.tab which can be read using again a 
Mathematica session of FIRE. 
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There are several benchmark calculations which have 
been performed with the help of FIRE. Among them is 
the reduction of all three-loop integrals needed for the 
static potential |’5j3-58) which involves eight indices for 
massless relativistic propagators and in addition three 
indices for static propagators of the form 1 /&q. A par¬ 
ticular challenge poses the case for general QCD gauge 
parameter f which involves about 20 million integrals, 
60 times as much as the f = 0 case. A further reduction 
problem involves four-loop on-shell integrals needed for 
the relation between the MS and on-shell quark mass re¬ 
lation or the electron anomalous magnetic moment (see, 
e.g., Ref. ED). 

6.2. Numerical evaluation of master integrals with 

FIESTA 

FIESTA [60j- 621 stands for Feynman Integral Eval¬ 
uation by a Sector decomposiTion Approach and is a 
convenient tool to numerically evaluate Feynman integ¬ 
rals using the method of sector decomposition. The lat¬ 
ter is an algorithmic procedure to extract the e poles 
of a given Feynman integral in the so-called alpha- 
representation and provide an integral representation for 
the coefficients. After the pioneering work of Binoth 
and Heinrich [ 63, 64] several programs have been pub¬ 
lished where different strategies have been implemen¬ 
ted. Among them are sector_decomposition (65], 
secdec (64l[66l EZi and FIESTA l60Ug2l . 

The basic philosophy of FIESTA is that all kinematic 
variables are specified at an early stage which is dif¬ 
ferent from other approaches like, e.g., secdec, where 
generic manipulations are performed up to a certain 
point and only then numerical values for masses and 
momenta are specified. 

The use of FIESTA splits into the following two steps: 
In a first step the momentum integrals are transformed 
into the alpha-representation and the sector decomposi¬ 
tion algorithm is applied. The corresponding manipula¬ 
tions are performed in Mathematic a and can be done in 
parallel mode. For many applications this step is quite 
fast, however, quite often, in particular at higher loop or¬ 
der, huge expressions are generated which require main 
memory in the range of hundred Gigabyte. In such cases 
it is convenient to store the results into a database El 
since in general this step has to be performed only once. 

The second step is concerned with the numerical in¬ 
tegration. In principle this can also be performed within 
Mathematica, which is advantageous for small prob¬ 
lems or during the developing phase of the program. 
Complicated problems have to be integrated with the 
help of a C++ integrator which is based on the Cuba 
library [68], 69 ]. It uses the expressions stored in the 



Figure 15: Sample on-shell Feynman diagram where solid and 
dashed lines denote massive and massless lines, respectively. 


database during step one which provides several advant¬ 
ages. For example, it is possible to perform various runs 
choosing different values for the number of points used 
for the integrations. Furthermore, it is possible to copy 
the output of step one to a platform which is suitable for 
the numerical integration in massive parallel mode. 

Let us as an example consider the Feynman dia¬ 
gram in Fig. JB] which enters the four-loop relation 
between the MS-on-shell quark mass. Executing the 
Mathematica file 

Get["FIESTA3.m"]; 

NumberOfSubkernels=8; 

NumberOfLinks=8; 

UsingC=True; 

UsingQLink=True; 

ComplexMode=False; 

SDEvaluate[ UF[ {kl,k2,k3,k4}, 

{-(kl+ql)~2+m~2, 
-(k3+ql)~2+m~2, 

-(kl-k2)“2+m“2, 
-(k2-k3)~2+nT2, 
-(kl-k4)~2+m~2, 

-k4~2+m~2, 

-k3~2}, 

{m->l,ql~2->l}], 

{1,1,1,1,1,1,1},6] 

prepares both the integrand and performs the numer¬ 
ical integration using the corresponding C routines in the 
background. The result which is printed on the screen 
reads 

-276.907674 - 0.625006/ep~4 - 
4.937615/ep~3 + (-24.441689 + 

0.002*pm69)/ep~2 + (-85.919995 + 
0.015937*pm70)/ep + 0.083469*pm71 + 
ep*(-864.271585 + 0.468742*pm72) + 
ep~2*(-1503.357843 + 2.093833*pm73) + 
ep~3*(-6224.681821 + 9.755544*pm74) + 
ep~4*(11328.088699 + 40.591518*pm75) + 
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Figure 16: Speedup of the calculation of the various e orders of 
Feynman diagram given in Fig. [fusing 64 (dotted), 128 (dashed), 
256 (dash-dotted) and 512 (solid) cores normalized to the 32-core run. 


ep~5*(-18622.607506 + 176.76706l*pm76) + 
ep~6*(537473.776134 + 713.790523*pm77) 

The symbols pm indicate the uncertainty due to 
the Monte Carlo integration. In case the option 
OnlyPrepare = True; is added to the Mathematica 
file the integrand is prepared and stored to disk. Further¬ 
more the command is printed on screen which invokes 
the numerical integration from the shell without refer¬ 
ence to Mathematica. 

The result from runs performed at the High Perform¬ 
ance Computing Center Stuttgart (HLRS) are shown in 


Fig. 16 where the speedup for the individual e n terms 
(n = -3,..., 6) is shown. The blue (dotted), green 
(dashed), red (dash-dotted) and black (solid) curves 
(from bottom to top) corresponds to the use of 64, 128, 
256 and 512 cores where the results have been normal¬ 
ized to the 32-core run. It is interesting to note that an 
ideal speedup is obtained for 64 cores. Also for 128 
cores the curve is close to the maximal value of 4. Using 
256 instead of 32 cores still shows a quite flat behaviour 
with a speedup between 6 and 7. Strong variations in the 
speedup are observed for the use of 512 cores. The relat¬ 
ively low value for 1 / e ~ 3 can be explained with the fact 
that probably the expression, which shall be integrated, 
is too simple. On the other hand, for the (complicated) 
expression of the e 6 coefficient it might be that the disk 
access becomes the bottle neck. 

The main purpose of FIESTA is the fast and con¬ 
venient cross check of analytic calculations. Within 
CRC/TR 9 is has been applied in this way to several 
problems. An early version of FIESTA has been used to 
cross check the master integrals which contribute to the 


three-loop static potential j56l-[58l . Furthermore thir¬ 
teen four-loop on-shell integrals contributing to the MS- 
on-shell quark mass relation and to the muon anomalous 
magnetic moment, which have been computed analytic¬ 
ally in Ref. |[59lL have been cross-checked numerically 
with FIESTA. Recently also analytic results for master 
integrals of double-box topologies in the physical region 
have been cross-checked with the help of FIESTA f70l . 

There are also several projects where FIESTA has 
been used to evaluate the most complicated or even the 
major part of the master integrals numerically. For ex¬ 
ample, in the first calculation of the three-loop correc¬ 
tions of the quark and gluon form factor l22l (see also 
Ref. 1711 ) one coefficient in the e expansion of the three 
most complicated integrals could not be evaluated ana¬ 
lytically. Thus, the numerical results of FIESTA have 
been used which, for all practical purposes, leads to final 
results with sufficient precision. The analytic calcula¬ 
tion of the missing master integrals has been performed 
in Ref. l72l and perfect agreement with the numerical 
result has been found. 

For the calculation of the three-loop matching coef¬ 
ficient between QCD and non-relativistic QCD (NR- 
QCD) of the vector current m even the majority of 
the about 100 master integrals have been computed nu¬ 
merically with the help of FIESTA. In such cases it is 
important to perform strong cross checks. Among them 
are the change of the parametrization of the individual 
integrals. Thus, in intermediate steps different expres¬ 
sions are generated which are then integrated numeric¬ 
ally. Furthermore, it is possible to choose a different 
integrals basis and evaluate the new integrals again with 
the help of FIESTA. The agreement of the final expres¬ 
sion within the numerical uncertainty among the two set 
of master integrals serves as a strong checked for the ap¬ 
plicability of FIESTA. 


7. Summary 

The computer algebra program FORM is designed to 
handle huge expressions in a quite effective way. Still, 
for some physical applications even FORM would take 
several years which make a practical calculation im¬ 
possible. 

In the recent years parallel versions of FORM, 
ParFORM and TF0RM, have been developed and in the 
meantime they have become a reliable tools to perform 
computer algebra in parallel. ParFORM has demon¬ 
strated a good speedup behaviour both on SMP com¬ 
puters and on different cluster architectures. Further¬ 
more, for the current version of ParFORM the FORM pro- 
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grams written for the sequential version need not to be 
modified. 

TFORM is a parallel version of FORM based on POSIX 
threads and thus is bound to run on a single node. How¬ 
ever, there is less overhead connected to the paralleliza¬ 
tion and thus TFORM shows a slightly better performance 
than ParFORM. 

The main advantage of using a parallel version of 
FORM is the reduction of the wall clock time. In fact, 
there are a number of calculations where it has been ex¬ 
ploited that a speedup of about 10 can be reached with 
16 cores and thus the result was available after about 
a month instead of a year. A further advantage of us¬ 
ing TFORM or ParFORM is the fact that the size of the 
intermediate results, which have to be handled by the 
individual CPU, is smaller since the workload is distrib¬ 
uted among several workers. This advantage becomes 
particularly evident when using ParFORM on a cluster. 
In that case the intermediate expressions are stored into 
files which are located on different nodes. 

To obtain an even better speedup behaviour it would 
be necessary to improve the slope of the speedup curves 
and to push the flattening to higher number of pro¬ 
cessors. One starting point which could help to im¬ 
prove the situation is the sorting procedure. Another 
idea might be the combination of ParFORM and TFORM 
which could be an ideal tool for a cluster with multi-core 
nodes. 

In this article we also describe the programs FIRE and 
FIESTA. FIRE can be used for the reduction of integrals 
belonging to a given integral family to master integrals. 
FIESTA, on the other hand is a user-friendly tool to nu¬ 
merically compute the coefficients of the e expansion of 
multi-loop integrals. 
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